This notebook contains the exploration done into the data collected and uploaded to AWS in other notebooks. It utilizes the eda module built into the baseball package. Some of the visualizations produced within this notebook are included in the final report, but there are also additional explorations done here.
Part of Baseball Umpire Analysis Package for UMSI SIADS-591 & 592 Milestone I by Anthony Giove (agiove@umich.edu), Avinash Reddy (avimads@umich.edu), and Ryan Maley (rjmaley@umich.edu).
# importing the baseball package and our aws credentials. when baseball is imported,
# the imports necessary for any module within baseball are passed into this workspace.
from baseball import *
import aws
All of our data is housed on AWS. Utilizing ETL we created a pitches_expanded table which contains relevant content from many different tables that we created while parsing data.
# sql statement needed to pull data we want to explore
baseSql = '''SELECT pitch_type, stand, p_throws, type, hit_location, bb_type,balls,strikes,pfx_x,pfx_z, inning,
description, game_type, plate_x, plate_z, game_year, daynight, on_3b,on_2b, on_1b, outs_when_up, inning_topbot,
sz_bot, launch_speed, launch_angle, pitch_number, at_bat_number, bat_score, fld_score,
umpname, pitcher_name, height_in FROM pitches_expanded;'''
# using the sql module within baseball, we can directly create a dataframe while running oursql query
pitch = sql.createDF(sql.connect(aws.creds), baseSql)
pitch
Running the cell below will create the same dataframe as the aws query, without need for credentials
pitch = pd.read_csv('pitches_expanded.csv')
Now that we have the data needed for exploration we will run some functions from the eda module to prepare the data
# dataPrep bins the plate_x and plate_z columns so that comparisons and groupbys can be performed
pitch = eda.dataPrep(pitch)
pitch
# the umpireFilter function from eda allows us to specify a cutoff (limit) for the number of strikes
# the umpire has called within the collected data. For this notebook run, the threshold is 9,500
umps = eda.umpireFilter(pitch, limit=9500)
Ultimately went a different direction, but these were two approaches considered. The convex hull visual shows the outer boundary of what umpires deemed a strike, while the flat strike zone shows the bounds as a rectangle taking the highest and lowest values in both axes.
# initial exploration visualization 1
eda.convexHull(pitch, umps)
# initial exploration visualization 2.
eda.flatStrikeZone(pitch)
This section details additional cleaning done to prepare for the final visualizations and report generation. Utilizing the eda module we create dictionaries to house our dataframe for raw pitch data (ump_dfs, umpAll_dfs), groupby data (ump_dfsG, umpAll_dfsG) and comparison groupby data (ump_dfsC, umpAll_dfsC). By utilizing dictionaries it allows us to iterate though multiple dictionaries with the same keys and quickly create visualizations.
# creating the raw data dictionaries for individual umps and all umps
ump_dfs = eda.dfUmpDict(pitch, umps)
umpAll_dfs = eda.dfUmpDict(pitch)
# printing the keys from both dictionaries. within each individual ump dictionary,
# the keys shows with umpAll_dfs.keys() are nested
print(ump_dfs.keys(), umpAll_dfs.keys())
# grouping the raw data based on x and z bins for all umpires
umpAll_dfsG = {}
for key, val in umpAll_dfs.items():
umpAll_dfsG[key] = eda.zoneBinGroupby(val)
# grouping the raw data based on x and z bins for every individual umpire
ump_dfsG = {}
for ump, v in ump_dfs.items():
ump_dfsG[ump] = {}
for key, val in v.items():
ump_dfsG[ump][key] = eda.zoneBinGroupby(val)
# creating dictionary of dfs for every individual umpire compared with
# all umpires for each condition we want to investigate
ump_dfsC = {}
for ump, v in ump_dfsG.items():
ump_dfsC[ump] = {}
for key, val in v.items():
ump_dfsC[ump][key] = eda.compareDF(val, umpAll_dfsG[key])
Now we have all of our prepped dataframes within dictionaries. Utilizing the eda module's zoneHexViz function we can pass in different data frame to visualize different scenarios and conditions. The structure used allows for an arg (diff) to be passed which changes the image included in visualization and shows how something different instead of the raw data.
The next two visuals look at the difference between left handed and right handed batters. As expected the strike zones shift based on the handedness of the batter. The second visual includes a comparison plot which highlights the differences.
eda.zoneHexViz(umpAll_dfs['StandL'], umpAll_dfs['StandR'],
title1 = 'All Umpires - Left Handed Batters',
title2 = 'All Umpires - Right Handed Batters', gridSize=30)
dfC = eda.compareDF(umpAll_dfsG['StandR'], umpAll_dfsG['StandL'])
eda.zoneHexViz(umpAll_dfsG['StandR'], dfC, title1 = 'All Umpires - Left Handed Batters',
title2 = 'All Umpires - Right Handed Batters Compared to Leftys', diff2 = True, gridSize=18)
The next two visuals look at the difference between left handed and right handed pitchers. As expected the strike zones shift based on the handedness of the pitcher but not as significantly as it does when looking at batter handedness. The second visual includes a comparison plot which highlights the differences.
eda.zoneHexViz(umpAll_dfs['ThrowL'], umpAll_dfs['ThrowR'],
title1 = 'All Umpires - Left Handed Pitchers',
title2 = 'All Umpires - Right Handed Pitchers', gridSize=30)
dfC = eda.compareDF(umpAll_dfsG['ThrowR'], umpAll_dfsG['ThrowL'])
eda.zoneHexViz(umpAll_dfsG['ThrowR'], dfC, title1 = 'All Umpires - Left Handed Pitchers',
title2 = 'All Umpires - Right Handed Pitchers Compared to Leftys', diff2 = True, gridSize=18)
The next two visuals look at the difference between beginning and end of game. As expected the strike zones are very similar and do not show much change between beginning and end of game. The second visual includes a comparison plot which highlights the differences.
eda.zoneHexViz(umpAll_dfs['Begin'], umpAll_dfs['End'],
title1 = 'All Umpires - Beginning of Game (1st and 2nd innings)',
title2 = 'All Umpires - End of Game (8th inning and later)', gridSize=30)
dfC = eda.compareDF(umpAll_dfsG['End'], umpAll_dfsG['Begin'])
eda.zoneHexViz(umpAll_dfsG['End'], dfC, title1 = 'All Umpires - End of Game (8th inning and later)',
title2 = 'All Umpires - End of Game Compared to Beginning of Game', diff2 = True, gridSize=12)
The next two visuals look at the difference between short (5'10" and shorter) and tall (6'2" and taller) players. As expected the strike zones shift based on the height of the batter. As player height increases the strike zone rises. The second visual includes a comparison plot which highlights the differences.
eda.zoneHexViz(umpAll_dfs['Short'], umpAll_dfs['Tall'],
title1 = 'All Umpires - Batters 5ft 10in and Shorter',
title2 = 'All Umpires - Batters 6ft 2in and Taller', gridSize=12)
dfC = eda.compareDF(umpAll_dfsG['Short'], umpAll_dfsG['Tall'])
eda.zoneHexViz(umpAll_dfsG['Short'], dfC, title1 = 'All Umpires - Batters 5ft 10in and Shorter',
title2 = 'All Umpires - Short Batters Compared to Tall', diff2 = True, gridSize=10)
The next two visuals look at the difference between day and night games. As expected the strike zones is very similar with minor differences. The second visual includes a comparison plot which highlights the differences.
eda.zoneHexViz(umpAll_dfs['Day'], umpAll_dfs['Night'],
title1 = 'All Umpires - Day Games',
title2 = 'All Umpires - Night Games', gridSize=30)
dfC = eda.compareDF(umpAll_dfsG['Day'], umpAll_dfsG['Night'])
eda.zoneHexViz(umpAll_dfsG['Day'], dfC, title1 = 'All Umpires - Day Games',
title2 = 'All Umpires - Day Compared to Night Games', diff2 = True, gridSize=18)
The next two visuals look at the difference between regular season and post season games. Overall the zones are very similar with a few spots which adjust the autoscaled axes and give those higher values. This shows that umpire strikes zones dont change much between regular season and post season. The second visual includes a comparison plot which highlights the differences.
eda.zoneHexViz(umpAll_dfs['Reg'], umpAll_dfs['Post'],
title1 = 'All Umpires - Regular Season Games',
title2 = 'All Umpires - Post Season Games', gridSize=20)
dfC = eda.compareDF(umpAll_dfsG['Post'], umpAll_dfsG['Reg'])
eda.zoneHexViz(umpAll_dfsG['Post'], dfC, title1 = 'All Umpires - Post Season Games',
title2 = 'All Umpires - Post Season Compared to Regular Season Games', diff2 = True, gridSize=18)
Now that we have explored the umpires as a whole through differnet game situations we can begin looking at individual umpires. To do this we will calcualte percentage differences between individual umpires and the collective whole by utilizing the groupby dataframes. By binning all of the data we created 1080 data points per dataframe for every umpire in every scenario. This allows us to normalize the comparisons even though every umpire would have a different number of pitches in each bin.
# creating new dictionary to house the umpires, keys, and percentage values
ump_dfsG_mean = {}
for ump, varDF in ump_dfsG.items():
ump_dfsG_mean[ump] = {}
for var in varDF.keys():
ump_dfsG_mean[ump][var] = round(100*((ump_dfsG[ump][var]['typeExp'].mean()-
umpAll_dfsG[var]['typeExp'].mean())/
umpAll_dfsG[var]['typeExp'].mean()),3)
# now that we have a dictionary with all the information, we can convert it to a dataframe
# using dict comprehension and some other pandas functionality
dfUmp = pd.concat({k: pd.DataFrame.from_dict(v, 'index') for k, v in ump_dfsG_mean.items()}, axis=0)
dfUmp = dfUmp.unstack(level=-1)
dfUmp.columns = dfUmp.columns.get_level_values(1)
dfUmp.sort_values(by='Overall',ascending=False, inplace=True)
dfUmp.reset_index(inplace = True)
dfUmp.rename(columns={'index': 'umpname'}, errors='raise', inplace=True)
display(dfUmp)
display(dfUmp.describe())
looking at the above dataframe, we can see which umpires are more likely or less likely to call strikes given differnet scenarios. This raw number allows of to feed umpire names and columns as keys into our dict (ump_dfsG_mean) to see a corresponding visualization. For demo we will show Bill Miller (most aggressive overall strikezone) and Tom Hallion (most conservative overall strikezone) and as expected Bill Miller has a lot more red in his comparison plot than Hallion who has more blue.
# Bill Miller - Most Aggressive Strike Calling Umpire
eda.zoneHexViz(ump_dfsG['Bill Miller']['Overall'], ump_dfsC['Bill Miller']['Overall'],
title1 = 'Bill Miller - Overall', title2 = 'Compared with All Umpires - Overall', diff2 = True, gridSize=15)
# Tom Hallion - Most Conservative Strike Calling Umpire
eda.zoneHexViz(ump_dfsG['Tom Hallion']['Overall'], ump_dfsC['Tom Hallion']['Overall'],
title1 = 'Tom Hallion - Overall', title2 = 'Compared with All Umpires - Overall', diff2 = True, gridSize=15)
The visualizations below will show the differences between the most aggressive and most conservative umpires in the game of baseball. As you can see in the images below, there is a significant difference in the sizes and shapes of the strike zones of these two umpires. Utilizing the built in functionality of comparison visuals, we can easily see it quantifed with respect to either umpire when compared to the other.
eda.zoneHexViz(ump_dfsG['Bill Miller']['Overall'], ump_dfsG['Tom Hallion']['Overall'], title1 = 'Bill Miller - Overall',
title2 = 'Tom Hallion - Overall', gridSize=14)
dfC = eda.compareDF(ump_dfsG['Bill Miller']['Overall'], ump_dfsG['Tom Hallion']['Overall'])
eda.zoneHexViz(ump_dfsG['Bill Miller']['Overall'], dfC, title1 = 'Bill Miller - Overall',
title2 = 'Compared to Tom Hallion', diff2 = True, gridSize=14)
dfC = eda.compareDF(ump_dfsG['Tom Hallion']['Overall'],ump_dfsG['Bill Miller']['Overall'])
eda.zoneHexViz(ump_dfsG['Tom Hallion']['Overall'], dfC, title1 = 'Tom Hallion - Overall',
title2 = 'Compared to Bill Miller', diff2 = True, gridSize=14)
We now have the ability to quickly visualize how an umpire compares to his peers, the cell below iterates through all scenarios for Joe West to give an idea of how he calls games.
# setting the viz up for Joe West
for key in ump_dfs['Joe West'].keys():
eda.zoneHexViz(ump_dfs['Joe West'][key], ump_dfsC['Joe West'][key], title1 = 'Joe West - '+key,
title2 = 'Compared to All Umpires - '+key, gridSize=10, diff2 = True)
Creating visualizations inside of a notebook is great for exploring, we wanted to add the ability to create documents that could be distributed to anyone interested. The vizGenerator built into our eda module allows that. You can change the report type from raw to comparison based on one argument. The visualizations can also be supressed or enabled depending on whether you want a document or embedded visualizations. The vizGenerator automatically iterates through the available umpire keys in the dictionary passed in. It also hits the nested dictionaries which creates a report for every umpire similar to the visuals shown above for Joe West.
### create a dictionary with all the keys and values that we want in the report below
reportViews = {}
reportViews['Overall'] = 'Overall Strike Zone'
reportViews['StandL'] = 'Strike Zone Left handed Batters'
reportViews['StandR'] = 'Strike Zone Right handed Batters'
reportViews['ThrowL'] = 'Strike Zone Left handed Pitchers'
reportViews['ThrowR'] = 'Strike Zone Right handed Pitchers'
reportViews['Begin'] = 'Strike Zone at the Begining of the game (<=2 innings)'
reportViews['End'] = 'Strike Zone at the End of the game (>=8 innings)'
reportViews['Short'] = "Strike Zone for shorter batters (less than 5'10)"
reportViews['Tall'] = "Strike Zone for Taller batters (greater than 6'2)"
reportViews['Day'] = 'Strike Zone for games happening during day time'
reportViews['Night'] = 'Strike Zone for games happening during Night time'
reportViews['Close'] = 'Strike Zone for games that are tight (diff in score <3)'
reportViews['Open'] = 'Strike Zone for games where one team has lead >3'
reportViews['Reg'] = 'Strike Zone for regular season games'
#reportViews['Post'] = 'Strike Zone for post season games' leaving off post season
# passing in the dictionaries (ump_dfsG and ump_dfsC) to iterated through and generate report
# these reports are word documents and a statement is printed to show successful completion.
# the report is housed in the /Report folder within this directory
eda.vizGenerator(ump_dfsG, ump_dfsC, vizSections= reportViews,comparison = True, generateReport=True)